Goto

Collaborating Authors

 scene transition


Enhancing Scene Transition Awareness in Video Generation via Post-Training

Shen, Hanwen, Lu, Jiajie, Cao, Yupeng, Yang, Xiaonan

arXiv.org Artificial Intelligence

Recent advances in AI-generated video have shown strong performance on \emph{text-to-video} tasks, particularly for short clips depicting a single scene. However, current models struggle to generate longer videos with coherent scene transitions, primarily because they cannot infer when a transition is needed from the prompt. Most open-source models are trained on datasets consisting of single-scene video clips, which limits their capacity to learn and respond to prompts requiring multiple scenes. Developing scene transition awareness is essential for multi-scene generation, as it allows models to identify and segment videos into distinct clips by accurately detecting transitions. To address this, we propose the \textbf{Transition-Aware Video} (TAV) dataset, which consists of preprocessed video clips with multiple scene transitions. Our experiment shows that post-training on the \textbf{TAV} dataset improves prompt-based scene transition understanding, narrows the gap between required and generated scenes, and maintains image quality.


MSG score: A Comprehensive Evaluation for Multi-Scene Video Generation

Yoon, Daewon, Lee, Hyungsuk, Shin, Wonsik

arXiv.org Artificial Intelligence

This paper addresses the metrics required for generating multi-scene videos based on a continuous scenario, as opposed to traditional short video generation. Scenario-based videos require a comprehensive evaluation that considers multiple factors such as character consistency, artistic coherence, aesthetic quality, and the alignment of the generated content with the intended prompt. Additionally, in video generation, unlike single images, the movement of characters across frames introduces potential issues like distortion or unintended changes, which must be effectively evaluated and corrected. In the context of probabilistic models like diffusion, generating the desired scene requires repeated sampling and manual selection, akin to how a film director chooses the best shots from numerous takes. We propose a score-based evaluation benchmark that automates this process, enabling a more objective and efficient assessment of these complexities. This approach allows for the generation of high-quality multi-scene videos by selecting the best outcomes based on automated scoring rather than manual inspection.


Ferreira

AAAI Conferences

This project aims to compose background music in real-time for tabletop role-playing games. To accomplish this goal, we propose a system called MTG that listens to players' speeches in order to recognize the context of the current scene and generate background music to match the scene. A speech recognition system is used to transcribe players' speeches to text and a supervised learning algorithm detects when scene transitions take place. In its current version, a scene transition occurs whenever the emotional state of the narrative changes. Moreover, the background music is not generated, but selected based on its emotion from a library of hand-authored pieces. As future work, we plan to generate the background music considering the current scene context and the probability of scene transition. We also consider to retrieve more information from the narrative to detect scene transitions, such as the scene's location and time of the day as well as actions taken by characters.